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Abstract 

Recent advances in cryptography promise to enable secure statistical computa¬ 
tion on encrypted data, whereby a limited set of operations can be carried out with¬ 
out the need to first decrypt. We review these homomorphic encryption schemes 
in a manner accessible to statisticians and machine learners, focusing on pertinent 
limitations inherent in the current state of the art. These limitations restrict the 
kind of statistics and machine learning algorithms which can be implemented and 
we review those which have been successfully applied in the literature. Finally, 
we document a high performance R package implementing a recent homomorphic 
scheme in a general framework. 

Keywords: homomorphic encryption; data privacy; encrypted statistical analysis; 
homomorphic encryption R package. 


1 Introduction 


The extensive use of private and personally identifiable information in modern statistical 
(and machine learning) applications can present an obstacle to individuals contributing 
their data to re s earch. As just one example, when considering contribution to biobanks 
Kaufman et all ( 2009 ) reported 90% of respondents had privacy concerns. Addressing 
these concerns is paramount if the participation rate in biomedical and genetic research 
i s to be increased, esp ecially for government and industry where public trust is lower 
( Kaufman et al . 20091 1. Indeed, industry is on the brink on embarking on biomedical 
applications on a scale never before witnessed via the impending wave of so-called ‘wear¬ 
able devices’ such as smart watches, which present serious privacy concerns. Companies 
hope to market the ability to monitor and track vital health signs round the clock, per¬ 
haps htting classihcation models to alert different health concerns of interest. However, 
such constrained devices will almost certainly leverage ‘cloud’ services, uploading reams 
of private health diagnostics to corporate servers. Herein, it is demonstrated how recent 
advances in cryptography allow individual privacy to be preserved, whilst still enabling 
researchers and industry to incorporate such data into statistical analyses. 

Moreover, the current explosion in cloud computing platforms promise to enable re¬ 
searchers and businesses to divest themselves of complex in-house compute server setups, 
but require one to vest all trust in the cloud provider maintaining conhdentiality of the 
data. 

One way to ensure trust in the scenarios above is through storage and disclosure of 
only secure, encrypted data. Encryption is a technique whereby data, termed a message 
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in cryptography, is mathematically transformed using an encryption key to produce a 
cipher text. The cipher text can only easily be decrypted to reveal the original data if the 
corresponding decryption key is known. Therefore, a cipher text can be stored openly 
without compromising privacy so long as the decryption key is kept secret. 

From a data science perspective, the problem with employing cryptographic methods 
to improve trust is that the data must at some point be decrypted for use in a statistical 
analysis. However, recent cryptography research in the areas of homomorphic and func¬ 
tional encryption are showing exciting potential to bypass this. An encryption scheme is 
said to be homomorphic if certain mathematical operations can be applied directly to the 
cipher text in such a way that decrypting the result renders the same answer as applying 
the function to the original unencrypted data. 

The remarkable properties of homomorphic encryption schemes are not without lim¬ 
itations, which typically include slow evaluation and the fact that the set of functions 
which can be computed in cipher text space is very restricted. However, by understand¬ 
ing the constraints and restrictions it is hoped that statistics researchers can assist in the 
research effort, adapting statistical techniques to be amenable to homomorphic compu¬ 
tation by making and quantifying reasonable approximations in those situations where a 
traditional approach cannot be implemented homomorphically. 

There are reviews and introductions to h omomorphic encryption aimed at different 
audiences and each wit h a different emphasis (jGentrvi . l2010l : IVaikuntanathanl. 1201 ll: ISenl. 
2013 : Silverberg . 20131 ). The aim of this paper is to provide statisticians and machine 


learners with sufficient background to become involved in developing methodology specif¬ 
ically crafted to homomorphic computation. As part of this effort we describe an accom¬ 
panying high performance R package providing an easy to use reference implementation as 


a cor e contribution of this work. In a sister publication flAslett. Esperanga and Holmes 


20151 ) we present some novel statistical machine learning techniques developed to be 


amenable to htting and prediction encrypted. 

In Section 2 homomorphic encryption is introduced covering the salient features for 
statistical work without drifting too far into cryptography theory unnecessarily, although 
full references and resources are provided for further reading. Section 3 reviews the statis¬ 
tical techniques which have been successfully implemented in the cryptography literature 
and existing software implementations of homomorphic schemes. Section 4 describes a 


high-level easy to use software implementation available as an R package (Aslett, 20141 1 


2 Homomorphic encryption 

This section presents an introduction to homomorphic encryption with an emphasis on 
details and limitations which are pertinent to applying statistics and machine learning 
methodology. 

2.1 Background on encryption 

An unencrypted number, m G M, is referred to as a message, while the encrypted version, 
c G G, is the cipher text, where M and C are the message space and cipher text space 
respectively. Typically M C Z, the integers or similar, whilst C will depend on the 
encryption algorithm being used. A given encryption scheme then utilises keys in order 
to map the message into a cipher text and to recover the message from a cipher text. 
There are two approaches: either there is a single secret key, or there are a public and 
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secret key. In the single secret key scheme the same key is used to map messages to 
cipher texts and vice versa, so this key must be kept private at all times. Conversely, a 
scheme which also has a public key uses that key to map messages to cipher texts, but 
uses the secret key to map back: consequently the public key can be openly disclosed. 
Hereinafter, only public key schemes are considered. 

Fundamentally encryption can be treated as simply a mapping which takes m and a 
public key, kp, and produces the cipher text, c Enc(/cp,m). Notationally, is used to 
signify assignment rather than equality, since encryption is not necessarily a function in 
the mathematical sense: any fixed inputs kp and m will produce many different cipher 
texts. Indeed, this is a desirable property for public key encryption schemes, referred to 
as semantic security: a scheme is semantically secure if knowledge of c for some m has 
vanishingly small probability of revealing further information about any other encrypted 
message. Informally, this means repeated encryption of the same message m will render 
different and seemingly unrelated cipher texts each time with high probability. Clearly, 
if encryption was an injective function for fixed kp, Enc : M ^ C, then any public key 
encryption scheme with a modestly sized message space could be trivially compromised. 
Semantic security is achieved by introducing randomness into the cipher text which is 
sufficiently small not to interfere with correct decryption when in possession of kg, but, 
as will become apparent in the sequel, this essential feature imposes a handicap on all 
currently known homomorphic schemes. 

Conversely, decryption is a function which renders the original message, m = Dec{ks, c). 
The crucial relation satisfied by any encryption scheme is therefore: 


m = Dec{ks, Enc(fcp, m)) ^ m E M 


Consequently, the security of an encryption scheme is based on the hardness of recov¬ 
ering m given knowledge of only c and kp. Some schemes are based on empirical hardness 
assumptions about particular problems, whilst others may rely on settings where the 
hardness can be rigorously proven. 

This is a simplification of general cryptographic schemes, since some of the most im¬ 
portant algorithms, such as the curr ent industry standard Advanced Encryption Standard 
(AES) I Daemen and Riimenl. 2002), do not normally operate value-by-value but rather 
on blocks of binary data. However, it encompasses the class of algorithms to be discussed 
in what follows. 


2.2 Homomorphic encryption 

The term homomorphic encryption describes a class of encryption algorithms which sat¬ 
isfy the homomorphic property: that is certain operations, such as addition, can be 
carried out on cipher texts directly so that upon decryption the same answer is obtained 
as operating on the original messages. In simple terms, were one to encrypt the numbers 
2 and 3 separately and ‘add’ the cipher texts, then decryption of the result would yield 5. 
This is a special property not enjoyed by standard encryption schemes where decrypting 
the sum of two cipher texts would generally render nonsense. 

More precisely, an encryption scheme is said to be homomorphic for some operations 
o G acting in message space (such as addition) if there are corresponding operations 
o G iFc acting in cipher text space satisfying the property: 

Dec(fcs, Enc(fcp, mi) o Enc(fcp, m2)) = mi o m2 
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For example, the simple scheme in iGentrvi (120101) describes a method where Tm = 
{+, x} and Tc = {+, x}, though there is no restriction that the opera tions must corre¬ 
spond in all schemes. For example, Paillier encryption flPaillierl. Il999l) is homomorphic 
only for addition, with Tm = {+} but where Tc = {x}. 

Note this is not a group homomorphism in the mathematical sense, since the property 
does not commute when starting instead from cipher texts, due to semantic security. That 
is, because the same message encrypts to different cipher texts with high probability, in 
general: 

Enc(/cp, mi) o Enc(/cp, m 2 ) 7 ^ Enc(/cp, mi o m 2 ) 

Moreover, generally mi > m 2 ^ Enc(fcp,mi) > Enc(fcp,m 2 ). Another consequence of 
semantic security is that operations performed on the cipher text may increase the noise 
level, so that only a limited number of operations can be consecutively performed before 
the noise must be reduced. 

The p ossibility of homomorphic encryption was proposed bv R.ivest. Adleman and 
Dertouzos (19781) and many schemes that suppo rted eithe r mult iplication (such as RSA 
(iRivest. Shamir and A dlem anl. Il978i). ElGamal (lElGamall. Il985h. etc! or add ition (such 
as Goldwasser-Micali ( Goldwasser and Micali . 1982h . Paillier ( Paillierl. 19991) . etc) were 
found. However, in many of these the number of times one could add or multiply was 


l imite d and a scheme supporting both operations simultaneously was elusive (iBoneh et al. 


(120051) came closest, allowing unlimited additions and a single multiplicati on). It was no t 
until 2009 that the three decade old problem was solved in seminal work bv iGentrvI (120091) . 
where he showed addition, multiplication and control of the noise growth were all possible. 
This sparked a cascade of work on fully homomorphic schemes: that is, those where a 
theoretically unlimited number of addition and multiplication operations are possible. 
This modern era of homomorphic encryption is briefly summarised in Appendix 

The advent of a scheme capable of evaluating both addition and multiplication a 
(theoretically) arbitrary number of times led to a surge of optimism, since then any 
polynomial can be computed and so the output of any suitably smooth function could 
in principal be arbitrarily closely approximated. Moreover, if M = {0,1} then addition 
corresponds to logical XOR, and multiplication corresponds to logical AND, which is 
sufficient to construct arbitrary binary circuits so that, in principle, anything which 
can be evaluated by a computer can be represented by an algorithm which will run on 
homomorphically encrypted data. However, caution is needed here regarding practicality: 
performing just a 32-bit integer addition using a simple ripple-carry adder design involves 
32 full adders, each requiring 3 XORs, 2 ANDs and an OR (= 2 XOR & 1 AND) — 
256 fundamental operations just to add two integers, an avenue it will become clear is 
impractical with current homomorphic schemes. 

A slightly whimsical but highly lucid a nd more detailed introduction to homomorphic 
encryption can be found in iGentrvI (120101) . A longer introduction and background is in 
3 (l2013h . 


2.3 The scheme of lFan and VercauterenI (120121 ) 


To make these ideas more concrete the particular scheme of iFan and VercauterenI (120121) 
(hereinafter FandV) will now be described. A high performance, easy to use implementa¬ 
tion of the same is a contribution of this technical report as discussed in Section HJ 

FandV is a fully homomorphic scheme where the message space accommodates repre¬ 
sentation of large subsets of Z (not just binary messages), and a cipher text is a pair of 
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large polynomial s. Its security is based on t he hardness of the ring Learning With Errors 
(LWE) problem f Lvnbashevskv et a,li 2010) which is connected to classical cryptography 
hardness results (such theory would be a diversion: for a short description see Appendix 


To simplify the presentation for a statistics audience, some minor simplifying restric¬ 
tions are made to the original scheme as will be explained. The reader may safely skip to 
Section 12.41 if the following mathematical details of this example encryption scheme are 
not of interest. 


2.3.1 Notation 


Zg is the set of integers {n : n ^ X, —q/2 < n < q/2} and [a]q denotes the unique integer 
in Zq which is equal to a mod q. Z[x] and Zq[x] denote polynomials whose coefficients 
belong to Z and Zq respectively. Then, for a hxed value d, the primary objects of in¬ 
terest in the scheme are the polynomial rings R = Z[a;]/<h 2 d(x) and Rq = Zq[a;]/$ 2 ‘*( 2 ^)) 
where ^ 1 is the 2Ath cyclotomic polynomiao The restriction to 2Ath 

cyclotomic polynomials here is for the convenience of their form, the computational effi¬ 
ciencies of reducing a polynomial modulo this form, and for the simplicity of generating 
random polynomials modulo this form which satisfy ring LWE hardness results (although 
theoretically FandV can be modulo any monic irreducible polynomial). 

To distinguish polynomials, they will be underscored a G Rq if not written in func¬ 
tional form, a{x). Polynomial multiplication will be emphasised, a-b and all such multi¬ 
plication takes place within the ring R. [a]q indicates the centred reduction above applied 
to each coefficient of a individually, so that a E R [a]q G Rq. 

The randomness to be introduced for semantic security comes via the bounded dis¬ 
crete Gaussian distribution, defined to be the probability mass function proportional to 
exp(—(2 (7^)) over the integers from —B to B, where typically B ^ lOcr. For the special 
choice of polynomial modulo $ 2 ‘^(^) above, the corresponding multivariate distribution 
denoted y on R then involves simply generating each coefficient of a:"", 0 < n < 2'^“^ — 1, 
from a bounded discrete Gaussian distribution. This simple sampling procedure arises 
due to the modulo <h 2 d(a;), which ensures that the coefficients are all independent af¬ 
ter modular reduction. Reducing modulo an arbitrary monic irreducible polynomial can 
introduce dependencies between coefficients which ceases to sati s fy the assumptions un¬ 
derlying the hardness results of ring-LWE f Lvnbashevskv et ai . 2ninh . leading to more 
complex sampling procedures. 

Rq, or correspondingly if 


If a is a uniform random draw from Rq this is denoted a 


a is a draw from the multivariate bounded discrete Gaussian draw induced on R, y, this 
is denoted a ~ y. 


2.3.2 The encryption scheme 

The message space of this scheme is the polynomial ring M = R^. Thus any integer 
message m must be converted to a polynomial representation m{x). In principle, if m is 
small enough that m E then the degree zero polynomial m{x) = m E Rt is sufficient. 

^In simple terms, $d(a:), the d-th cyclotomic polynomial is the polynomial which: divides x‘^ — 1 ; does 
not divide a:" — 1 for any n < d; has integer coefficients; and cannot be factorised. 

For example, $ 3 ( 0 ;) = + x + \ because {x^ + x + l)(a: — 1) = — 1, but it does not divide a;^ — 1 

or a: — 1 , it has integer coefficients and it cannot be factorised. 
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However, there are reasons which will become apparent that this is nndesirable even when 
m is small enongh (or t is large enongh). 

A better approach is to take an integer to be encrypted, write it in standard 6-bit bi¬ 


nary representation, m = X]n=o and then simply constrnct m{x) = 

Rf where an = 0 ^ n > b. Recovery of the original message after decryption is then 
simply evalnation of m(2) = m, becanse homomorphic addition and mnltiplication oper¬ 
ations will correspond to operations on the polynomials preserving the end resnlt. This 
representation is assnmed here and is nsed antomatically in the software contribntion of 
Section HI 

The cipher text space is the Cartesian prodnct of two polynomial rings C = Rg x Rg, 
where q ^ t. As will be seen, the message polynomial is essentially embedded in the 
log 2 (f) most significant bits of the first polynomial in C, with the random noise growing 
from the least significant bits. Once the noise grows nnder repeated operations and 
reaches the log 2 (t) most significant bits the message is lost. 

The parameters of the scheme are: d, determining the degree of both the polynomial 
rings M and C; t and q, determining the coefficient sets of the polynomial rings M and 
C; and a, determining the magnitnde of the randomness nsed for semantic secnrity. 

An example of valnes which ensnre good security wonld be d = 13 ( 4095 degree 

polynomials), q = 2^^®, t = 2^®, a = 16 ( Fan and Vercauterenl . 2012 ). The software 
contribution of Section H] provides functions to help select these parameters automatically 
based on lower bounds of security and computability they provide. 

Key Generation: The secret key, is simply a uniform random draw from i ?2 (he. 
sample a 2*^“^ binary vector for the polynomial coefficients). 

The public key, kp, is a vector containing two polynomials: 


2 ' 
n=0 


-1 n ^ 

QnX G 


K = := ([-(a + e)]q,a) eRgXRg 


where a^Rg and e ~ y. Note k^ is hard to extract from kp precisely due to ring LWE 
hardness results (see Appendix iBl). 

Encryption, Enc(/cp,m): An integer message m is hrst represented as m E Rt as de¬ 
scribed above. Encryption then renders a cipher text which is a vector containing two 
polynomials: 


c = (ci, ca) ■= {[kpi ■ u + + A ■ m]g, [kp^ ■ u e^lg) E Rg x Rg 

where u, e;^,e2 ~ y and A = [I"]. 

Decryption, Dec(fcs,c): Decryption of a cipher text c is by evaluating: 


m = 


+ C2 

q 



E Rt 


so that m = m{2). 

Addition, Addition in message space is achieved in cipher text space by standard 
vector and polynomial addition with modulo reduction: 

Cl + C2 = ([c^i C2i]g, [Ci2 + haa]?) 

It is an easy and enlightening exercise to verify by hand that T)ec{ks, ci + ca) renders m. 
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Multiplication, x: Multiplication in message space produces a more complex operation 
in cipher text space which increases the length of the cipher text vector: 


Cl X C2 = 


^(hii ■ C21) 


q L L 


^(hii • C22 + C12 ■ C21) 


q L L 


^(hl2 ■ ^ 22 ) 


Although it is still possible to recover m from one of these larger cipher texts by mod¬ 


ifying the decryption function to be 


q [hi + C2 




+ C3 • fc, • k^]q 


it is preferable to 


perform a ‘relinearisation’ procedure which compacts the cipher text to a vector of two 
polynomials again and reverts to the original decryption procedure. Thus in practice mul¬ 
tiplication is a two step procedure: cipher text multiplication followed by relinearisation. 
Description of reline a risati on is beyond the scope of this review, but full details are in 
Fan and Vercanteren ( 20121 ) and it is seamlessly implemented in the software contribution 
described in Section 01 


2.3.3 A practical note 

Above, a binary polynomial representation of integers was proposed as being preferable 
to a scalar (zero degree polynomial) representation (i.e. a natural number), even when 
the message is small enough that m G the reason for which should now be clearer. 

Consider the addition operation with the example parameters given above, recall that 
each coefficient of 7h{x) must he in the range —16,383 to 16,384 after computation in 
order to decrypt correctly, and note that the addition operation results in direct addition 
of coefficients in the polynomial representations. Now, bearing these points in mind, if 
m{x) = m then addition will only render the correct answer so long as the overall hnal 
result also remains in the range —16, 383 to 16, 384. However, with a binary representation 
the largest coefficient of any term in m{x) will be ±1, so that at least 16, 384 additions 
(possibly more) can be performed and still guaranteed to decrypt correctly, furthermore 
allowing the hnal result, to be much larger than ±16,384. Not only is this more 

additions, but more importantly the binary representation allows a general hard bound 
for how many additions can be performed while still guaranteeing the correct value is 
decrypted, without knowledge of the messages. 


2.4 Some limitations 


At th is juncture it is important to temper any building excitement. Although iGentrv 
(120091) theoretically provided an exemplar for how fully homomorphic schemes could 
be constructed, the extraordinary theoretical possibilities are constrained by practical 
limitations. These crucial limitations mean that it is not simply a matter of taking any 
algorithm and converting it to run on encrypted data, so that many statistical algorithms 
are in fact beyond the computational reach of existing homomorphic schemes. 

The limitations discussed now are in general common to all current homomorphic 
schemes to a varying degree, though specific homomorphic encryption algorithms may 
have their own additional constraints. In each case, the limitation will be highlighted in 
the context of the scheme described in Section 12.31 
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2.4.1 Message space 


There are currently no schemes which will directly encrypt arbitrary values in R. In¬ 
deed, the most common message space is simply binary, M = {0,1}, with this being of 
particular appeal to theoretical cryptographers because it corresponds to construction of 
arbitrary Boolean circuits and allows all the results in computational complexity theory 
to be applied to determine computability. However, from a practical standpoint this is 
not presently a very feasible avenue. 

However, there are schemes which have an expanded message space, such as M = 
Z/nZ, or M = {—n, —n + l,...,n — 1,^,} for some integer n. These schemes generally 
correspond to integer rings or fields (for prime n) where ordinary rules of arithmetic can 
be assumed when results are bounded by n. In many schemes which support expanded 
message spaces, increasing n will impact the capabilities of the scheme (decreasing secu¬ 
rity, computation speed, computational depth or all these). 

A method which can be used to increase the size of the message space is via the 
Chinese Remainder Theorem as a means of repr esenting a large integer. 

Chinese Remainder Theorem (IKmithl. 119971 . p.270) Let mi,..., m^ G Z"*" be pairwise 

and let a,xi,... ,Xk € Z. Then there is 


i=l 


rrii 


coprime positive integers. Let M = Yl 
exactly one integer x that satisfies the conditions: 


a < X < a + m 


and 




mod m,- \/ 1 < i < k 


Thus, an integer message a; G [a, a-|-m) can be uniquely represented by the collection 
of smaller integers {xi}^^i, called the residues. More formally, Z/M = Z/mi x • • ■xTj/mk. 
So, if each m* is chosen small enough that the scheme can encrypt it, then much larger 
message spaces can be achieved by encrypting the collection of r esidues . The process 
is reversible so that the value x can be recovered given {au . flKnuthI Il997l. p.274). 
Such a representation is called a residue number system f Garnerl. " 1959 1 and has the 
additional advantage that addition and multiplication operations (the only ones which 
can be performed homomorphically anyway) are embarrassingly parallel: performing the 
same operation according to the modular arithmetic of each residue will result in a residue 
representation of the corresponding result of operating on the large integers. 

Related and more common in the homomorphic encryption literature, is the reverse 
usage of the polynomial version of the Chinese Remainder Theorem, which enables com¬ 
bining multiple messages into a single polynomial representation (that is, m now holds 
multiple plain text messages before encryption), so that operations on the single cipher 
text performs simultaneous operations on all the messages simultaneously in a manner 
akin to Sin gle Instruction Multiple Data fSIMDl instructions on a CPU f Smart and 
Vercauteren, 120141) . This of course reduces rather than increases the possible range of 
individual messages which can be encrypted. 

Even if using the Chinese remainder theorem to represent larger values, the issue 
remains of how to handle statistical data, which is commonly not binary or integer. There 
are at least two approaches: the hrst is common throughout the literature, whereby any 
real value is approximated by some rational number, with numerator and denominator 
encrypted separately and propagated through using the usual r ules of arithmetic for 
fractions. The second is a logarithmic representation developed bv iFrauz e.t al\ (120101) . in 
which division is possible but where addition and subtraction become substantially more 
complex to implement. 

The FandV scheme has an unusual message space, being a polynomial ring. For the 
example parameter values given above, this means that when using the binary represen- 
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tation of integer values, the integers can in principle be very large (over ±10^^^^). As 
such, the limitation in message space size may seem less acute than in other homomorphic 
schemes (especially binary ones), but the practical issue raised in 1 12 .3 .31 means that it 
may still be advantageous to use a residue number system representation if there will be 
a lot of addition. 


In the follow on to this review flAslett. Esperanca and Holmesl. 120151) . two other ap¬ 
proaches are proposed: one where data is effectively quantile binned in a binary indicator 
fashion, which is shown to effectively enable simple comparison operations; and another 
discretisation of real values which is appropriate for linear modelling. 


2.4.2 Cipher text size 

Once the value to be encrypted has been appropriately represented such that only ele¬ 
ments of M need to be encrypted, there is the additional issue of a substantial inflation 
in the size of the message after encryption, often by several orders of magnitude. 

As a concrete example, the usual representation of an integer in a computer requires 
4 bytes of memory. If such a message is encrypted under the scheme presented in Section 
Ea then using the example parameters will result in cipher texts occupying 65, 536 bytes 
(4096 coefficients, each a 128-bit integer). Consequently, a 1MB data set will occupy 
nearly 16.4GB encrypted. 


One mitigating proposal flNaehrig et a/.l. 120111) is to initially encrypt values using a 
non-homomorphic, size efficient encryption algorithm such as AES, and to encrypt the 
AES decryption key with a homomorphic scheme. The decryption circuit for AES can 
then be executed homomorphically, rendering a homomorphic encryption of the original 
message. This would mean that communication and long term storage of encrypted values 
could be space efficient, with expanded homomorphic cipher texts generated by effectively 
‘recrypting’ from this compact format when computation is req uired. AES is a n ind ustry 


standard, but required 36 hours to execute homomorphically flGentrv eA, all 120121) (for 


56 AES blocks, corresponding to 896 bytes of data), although a more recent lightweight 
cipher name d SIMON can be recrvnted homomorphically in around 12 minutes i Tveooint 
and Naehrig, 120141 ). However, these approaches operated on binary messages, so the 
resulting recryption is to a binary scheme with the attendant issues already discussed. 


2.4.3 Computational cost 

Elements of cipher text space are not only larger in memory (with an associated addi¬ 
tional computational cost to process), but will typically also be more complex spaces. 
For example, in Section 12.31 the cipher text space is the ring of polynomials modulo a 
cyclotomic polynomial, with coefficients from a large integer ring (e.g. 128-bit integers). 
Consequently, arithmetic operations are substantially more costly than standard arith¬ 
metic: there is large polynomial arithmetic involving coefficients which are too large to fit 
in standard 32-bit or 64-bit integers, with the additional overhead of modulo operations 
on both the coefficients and polynomial. 

Most current schemes can achieve reasonable speeds for additions, but are very con¬ 
strained in speed of multip lications. Th e optimised scheme implemented in the R package 
HomomorphicEncryption f Aslettl. 2014 ) achieves thousands of additions per second, and 
about 50 multiplications per second. This is mitigated as far as possible by transparently 
implementing full CPU parallelism. 
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If all the operations involved can be performed in a single instrnction mnltiple data 
(SIMD) fashion then the polynomial Chinese remainder theorem allnded to above can be 
nsed when representing the messages as a polynomial prior to encryption. In this way a 
single cipher text operation a ctnally operates in a SIMP ma nner on many messages for 
the same compntational cost ( Smart and Vercauteren . 2014I) . Natnrally, there is a limit 
to how many messages can be packed into a single cipher text in this way. 


2.4.4 Division and comparison operators 

At present there are no homomorphic schemes capable of natively snpporting division 
operations, only addition and mnltiplication. An additional serions constraint is the in¬ 
ability to have any conditional code flow: comparison operators snch as tests of eqnality 
and ineqnality cannot be performed on the encrypted data. Conseqnently, many algo¬ 
rithms appear ont of reach withont snbstantial redevelopment. 


2.4.5 Depth of operations 


The hnal limitation relates to the nnmber of operations which can be applied. As ex¬ 
plained in the discnssion on semantic security, there is randomness injected into the 
cipher text in these encryption schemes. When operations are performed, the noise tends 
to accumulate (exactly how being scheme dependent): for example, in many schemes 
multiplication operations result in direct multiplication of the noise components leading 
in the naive case to potentially exponential increases in the magnitude of the noise over 
many operations. Once the noise exceeds a certain threshold then decryption will render 
the incorrect message. 

It is important to be clear that it is not usually the total number of multiplica¬ 
tions which is limited, but rather the depth (i.e. the maximum degree of the eval¬ 
uated polynomial). For example, xi x 0:2 x has multiplicative depth 2, whereas 
xi X X 2 + xz X Xi + ■ ■ ■ + Xn -1 X Xn has multiplicative depth 1 Vn. Exactly what depth 
a scheme can achieve will depend on the scheme itself and usually on the parameters 
chosen, which commonly involves a tradeoff of speed, security or memory requirements 
against depth of operations. 


In principle, one of the breakthrough aspects of iGentrvi ’s (120091) work was the ability 
to bootstrap (entirely unrelated to the statistics term) a cipher text: an operation which 
resets the noise to that of a freshly encrypted message. However, most bootstrapping 
routines are very complex to implement, extremely slow to execute, or both. As a result, 
it is almost universal in the applied cryptography literature to set the parameters of the 
scheme under consideration to be such that the necessary depth of operations can be 
performed without a bootstrapping step being required. The software contribution of 
Section H] provide functions to help automatically select the parameters based on lower 
bounds in the literature for the depth of multiplications required. 


2.4.6 Motivation 

To date the small number of applied cryptography papers have largely taken existing 
statistical techniques which can be made to directly £t within these constraints and 
demonstrated any minor refactoring of the algorithms that is necessary, but leave them 
fundamentally unaltered (some examples are reviewed in Section [3]). However, statis¬ 
ticians and machine learners are well placed to develop principled approximations to 
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current statistical and machine learning techniques, or entirely new techniques, where 
the constraints of homomorphic encryption are considered at all stages of model and 
algorithm development, and where uncertainties and err ors introduced can be studied. 


Some initial contributions in this direction are presented in lAslett. Esperanca and Holmes 
fl2ni5l b 


2.5 Usage scenarios 

The most obvious usage scenario is to outsource long-term storage and computation 
of sensitive data to a third party cloud provider. Here the ‘client’ (the owner of the 
data) encrypts everything prior to uploading to the ‘server’ (at the cloud provider’s data 
centre). Due to some of the limitations discussed above, this scenario is perhaps currently 
only suitable in a restricted set of situations where the added computational costs and 
inflated data size are not prohibitive. With homomorphic schemes improving all the time 
the boundary where this is a practical usage scenario will shift over time. 

However, with the explosion of extremely compute, memory and battery constrained 
devices such as smart watches and glasses it may be that scenarios where additional server 
side memory and compute costs are a worthwhile trade-off are substantially broader. This 
is especially true given the biomedical focus of many of these recent devices which collect 
a lot of sensitive health data: collection of this on constrained client devices and handoff 
to a cryptographically secure server storage area which is capable of encrypted statistical 
analysis is an attractive proposition for both users and manufacturers. 

An additional scenario is one in which it is desirable to be able to perform statistical 
analyses without the data being visible to anyone at all. To be concrete, consider a 
research institute requiring patient data for analysis: the research institute could widely 
distribute their public key to enable patients to securely donate their sensitive personal 
data. This data would be encrypted and sent directly to the cloud provider who would 
have a contractual obligation to only allow the research institute access to the results 
of pre-approved functions run on that data, not to the raw encrypted data itself. Peer 
review would be important for pre-approving certain functions to be homomorphically 
executed to ensure that the original data is not indirectly leaked. An interesting effect 
here may be increased statistical power (despite homomorphic approximations) due to 
the greater sample sizes which could result from increased participation because of the 
privacy guarantees. 

There is at least one further usage scenario: that is, where there is confidential data on 
which a confidential algorithm must be run. In this situation, a client may encrypt their 
data to give to the developer of the algorithm and receive the results of the algorithm 
without either party compromising data or algorithm. In this situation, the constraints of 
homomorphic encryption are merely an opportunity cost because there may be no other 
way to achieve the same goal. 


3 Current Methods 

There are two aspects which, from the perspective of a statistician, are important to 
review: prior work on encrypted statistics algorithms and existing software implementa¬ 
tions for making use of homomorphic encryption schemes. 

In this section, both aspects are surveyed before the software tools documented in 
this paper are covered in Section 01 
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3.1 Encrypted statistics 


In the recent years, some work has emerged on statistical methods for homomorphically 
encrypted data. 


Graepel et al. (j2012|) proposed algorithms for binary classification, namely secure 
versions of the Linear Means and Fisher’s Linear Discriminant classifiers. The algorithms 
are rewritten in such a way that divisions are avoided but the original score function 
(needed for classihcation) is computable up to a constant. Because some operations have 
no counterpart in the encryption framework (like division and comparison), some of the 
computation is done offline by the client after decrypting results returned by the cloud. 
For instance, in binary classihcation y G {—1,1} with Linear Means, the class label 
is computed in this way as the sign of a score function. To represent real numbers as 
integers, the authors propose a rescaling approach which approximates real numbers with 
rational numbers (integer numerator and denominator) and then clears denominators by 
multiplying all numbers by an appropriate factor and rounding the result to the nearest 
integer. Approximation accuracy can be controlled in this way. 


Wu and Haven (1201211 extended previous work on encrypted statistics (ILanter et al. 


20111 ). namely the computation of mean and covariance in a multivariate scenario, us¬ 
ing the same technique of returning separate encrypted numerators and denominators. 
Additionally, they also mention the possibility of implementing (and indeed implement) 
low-dimension linear regression [d < 5) by using Cramer’s rule to invert the matrix 
X^X which is required to obtain the ordinary least squares estimates of the regression 
parameters fd = {X'^X)~^X'^Y. Because Cramer’s rule also involves a division by the 
determinant of X'^X, the computation can not be completely performed homomorphi¬ 
cally and must be hnished offline by the client who assembles the division factors post¬ 
decryption. Apart from the computational issues caused by division, there are additional 
problems here, the most important being the complexity of Cramer’s rule: for a problem 
with dimension d, the computation of the determinant has multiplicative depth d — 1 
and requires 0{d\) multiplications. Allied to this comes the computation of the adjoint 
matrix, having similarly substantial computational complexity. The restriction is two¬ 
fold: hrstly, in the multiplicative depth of operations; and secondly, in the computational 
costs of these operations. Whereas the second restriction implies possible intractability 
of high-dimensional linear regression, the hrst restriction affects correctness of decryption 
and so should be regarded as more serious. 


Lauter et al. (120141 observed that it is possible to analyse genomic data in a privacy¬ 
preserving framework and provide some examples of algorithms in statistical genetics 
which are implementable under the restrictions of homomorphic encryption, including the 
Cochran-Armitage trend test, the expectation-maximisation algorithm and measures of 
goodness-of-ht and linkage disequilibrium. The main issue in implementing these meth¬ 
ods under the homomorphic encryption framework is that divisions are not possible. The 
solution proposed is to write the statistics in terms of the two factors involved in a di¬ 
vision (dividend/numerator and divisor/denominator), compute these homomorphically 
and send them back to the client, who decrypts each factor and performs the division of¬ 
fline. For complex problems where divisions can not be grouped (by combining dividends 
and divisors), there will be a higher number of cipher texts being passed to the client, 
which increases communication costs and, more importantly, may compromise privacy 
since more information is contained in less processed cipher texts. 

Another class of privacy-preserving statistical methods has been proposed for predic- 
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tive purposes: an algorithm is trained offline (say, a regression model) and the corre¬ 
sponding predictive model (the parameters in the regression model, /3) encrypted. For 
prediction tasks, covariates are encrypted and sent to the server, where computations 
take place (e.g., the computation of the regression model predictor, Xi.fi) and are then 
returned to the client for decryption (and potentially further transformation, as would 
be the case for g eneralised linear models). Ex amples of these incl ude logistic regression 
(IBos et all 120131 ) and hidden Markov models fjPathak et all 120111) 

Crucially, in all these current methods, existing algorithms are simply refactored to 
run homomorphically rather than developing novel approaches to approximate otherwise 
currently intractable statistical techniques. 


3.2 Implementations 

As will be clear from Section 12.31 many homomorphic schemes can be non-trivial to 
implement. Some public implementations are releases of software which was written for 
a specific paper, whilst there are a small number of libraries or packages enabling reuse. 
Most libraries or packages commonly interfaces i n low-leyel lang uages such as C/C+-I- . A 


very compact single C hie library implementing iGentrvi (120101) is ‘libfhe’ flMinarl. 120101) . 


This implementation is based on a binary scheme, but has routines to allow encryption 
of integers by base-2 decomposing, encrypting each binary digit separately and then 
implementing binary adder arithmetic (so that even addition will involve cipher text 
multiplications). There is no bootstrapping implementation and at time of writing there 
have been n o apparent upda tes since 2010. 


‘Scarab’ (iPerl et ai.l.l201lf ) is ano ther low-level C library, implem enting instead another 


integer cipher text space scheme bv ISmart and VercauterenI (120101 ). This implementation 


allows only encryption of a binary message, although as well as providing addition (XOR) 
and multiplication (AND), there are full and half adders provided offering carry in and 
carry out or just carry out, respectively. A bootstrapping routine is also provided. There 


Another low level implementation, ‘HELib’ (, 

dalevi and Shoun 

. 201461). orovides a 

C-I--I- library implementing Brakerski et al. ( 

201^ 

1), one of the early second generation 


of schemes (see Appendix lAl) . It incorpo rates some very useful optimisations, including 


the work of Smart and Vercauteren ( 20141) . which enables single-instruction multiple-data 


(SIMD) parallelism by packing multiple values in a single cipher text. This is under active 
development at the time of writing and appears the most comprehensive implementation 
of a mod ern scheme currently avai lable. Details of the algorithms used are available in 


preprint (IHalevi and Shoupl. l2014d) . 


Finally, there wa s a r ecent comparison of two sch emes, iFan and VercauterenI (120121) 


and Bos et al 


used (iLepoint 


(120131) . in ILepoint and NaehrigI (120141 1 which provided the C-|--|- software 
20141) . Although not in the explicit form of a library it could be possible 


to transform this into a C-|--|- library for the two schemes. 


4 HomomorphicEncryption R package 

For statistics researchers to be able to use homomorphic encryption techniques, an easy 
to use yet high performance library in a high level languag e which is popular in the 
community is necessary. An R language (IR Core Team!. 120141) package providing such an 
implementation is a contribution of our work. 
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The HomomorphicEncryption R package flAslettl. 120141 1 provides an easy to use inter¬ 
face to begin developing and testing statistical methods in a homomorphic environment. 
The package has been developed to be extensible, so that as new schemes are researched 
by cryptographers they can be made available for use by statistics researchers with min¬ 
imal additional effort. The package has a small number of generic functions for which 
different cryptographic backends can be used. The underly ing implementation is mostly 
in high performance C and C-|--|- (lEddelbnettel et oi.l. 120111), with many o f the o perations 
setup to utilise multi-core parallelism via multithreading ( Allaire et al . 2ni4h without 
requiring any end-user intervention. 

The first generic cryptographic function is pars. The first argument to this function 
designates which cryptographic backend to use and allows the user to override any of the 
default parameters of that scheme (for example, d,q,t and a of Section [23]) • Related to 
this, there is the alternative method of specifying parameters via the function parsHelp. 
This allows users to instead specify a desired minimal security level in bits and a minimal 
depth of multiplications required, and then computes values for d, q, t and a which will 
satisfy these requirements w ith high probability, by automatically optimising estab lished 
bounds from the literature fiLepoint and Naehrid. 120141: ifandner and Peikertl . l201ll) 

The second generic cryptographic function is keygen, whose sole argument is a pa¬ 
rameter object as returned by pars or parsHelp. keygen then generates a list containing 
public ($pk) and private ($sk) keys, along with any s chern e dependent keys (such as 
relinearisation keys in the case of Fan and Vercauterenl f 2012l H. which correspond to the 
homomorphic scheme designated by the parameter object. At this point, the parameter 
object is absorbed into the keys so that it doesn’t need to be used for any other functions. 

The third generic cryptographic function is enc. This requires simply the public 
key (as returned in the $pk list element from keygen) and the integer message to be 
encrypted. It then returns a cipher text encrypted under the scheme to which the public 
key corresponds. Crucially, the ease of use begins to become very apparent here, with 
enc overloaded to enable encryption of not just individual integers, but also vectors 
and matrices of integers defined in R. The structure of the vectors and matrices are 
preserved and the encryption process is fully multithreaded across all available CPU 
cores automatically. 

The final generic cryptographic function is dec. Similarly, this requires simply the pri¬ 
vate key, as returned in the $sk list element from keygen, and the (scalar/vector/matrix) 
cipher text to be decrypted. It then returns the original message. Note that the structure 
of vector or matrix cipher texts is correctly preserved throughout. 

The real simplicity becomes evident when manipulating the cipher texts. All the 
standard arithmetic functions (+, -, *) work as expected, implementing for example the 
cyclotomic polynomial ring algebra of the FandV scheme transparently. Moreover, vectors 
can be formed in the usual R manner using c (or extracted from the diagonal of matrix 
cipher texts with diag), element wise arithmetic can be performed on those vectors (with 
automatic multithreaded parallelism) and there is support for all the standard vector 
functions, such as length, sum, prod and ’/o*"/. for inner products, just as one would 
conventionally use with unencrypted vectors in R. Indeed, such functionality extends to 
matrices, with formation of diagonal matrices via diag from cipher text vectors, element 
wise arithmetic and full matrix multiplication using the usual R operator (again, 
automatically fully parallelised). Matrices also support the usual matrix functions (dim, 
length, t, etc). The package automatically dispatches these operations to the correct 
backend cryptographic routines to perform the corresponding cipher text space operations 
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transparently, returning cipher text result objects which can be used in further operations 
or decrypted. 

The following is the simplest possible instructive example. Examining the contents of 
k, cl, etc will show the encryption detail: 

1ibrary (HomomorphicEncryption) 

p <- parsC'FandV") 

k <- keygen (p) 

cl <- enc(k$pk, c(42, 34)) 

c2 <- enc(k$pk, c(7, 5)) 

cresl <- cl + c2 

cres2 <- cl * c2 

cresl[1] 

dec(k$sk, cresl) 
dec(k$sk, cres2) 


Note that indexing into vectors and matrices as provided by R via the usual [] 
notation is fully supported, including assignment. 

We hope this provides a distinctly easy-to-use software implementation in arguably the 
most popular high level language in use among data scientists today, including automatic 
help for encryption scheme parameter selection to aid non-cryptographers. Moreover, 
given the computational burden of homomorphic schemes, the transparent multithreaded 
parallelism automatically across all CPU cores in all available scenarios (encryption, 
decryption and arithmetic with vectors/matrices) enables focus to be on the subject 
matter questions. 


At present, the scheme of Fan and Vercauteren ( 201^ (described in Section has 
been imple mented, making use of FLINT ([Hartl. 120101) for ce rtain polynomial operations 
and GMP ( Granlund and the GMP development tea^. 2012 ) for high performance arbi¬ 
trary precision arithmetic. Backends for further homomorphic encryption schemes may 
be added in the future. 


Table [T] provides indicative timings for common operations using the default param¬ 
eters of the package fwhich match the default parameters suggested in Fan and Ver¬ 
cauteren, 


2ni2h . 


Table 1: Timings (in seconds; average of 100 repetitions) for oper¬ 
ations on cipher texts using the Homomorphi cEncrypt ion package. 
All timings performed on an Amazon EG2 c4.8xlarge instance for 
reproducibility. S represents a scalar, V a vector of size 100 and M 
a matrix of size 10 x 10. 


scalar operations vector operations matrix operations 


s+s 

0.003 

U+U 

0.58 

M+M 

0.87 

s*s 

0.084 

V*V 

1.59 

M*M 

8.49 



V1*W 

1.59 

M°iyiM 

10.21 
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5 Conclusions 


This technical report has provided a review of homomorphic encryption with a focus 
on issues which are pertinent to statisticians and machine learners. It also introduces 
the HomomorphicEncryption R package and demonstrates the ease of getting started 
experimenting with homomorphic encryption. 

The practical limitations of homomorphic encryption schemes means that existing 
techniques cannot always be directly translated into a corresponding secure algorithm. 
This presents an opportunity for the statistics and machine learning community to en¬ 
gage with research in privacy preserving methods by developing new methods which are 
tailored to homomorphic computation and which work within the constraints described 
in Section 1231 with the sister paper to this review ( Aslett. Esperanca and Holmesl. 2015 ) 
being an initial contribution in this direction. 
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A Modern homomorphic schemes 


The groundbreaking work bv ICenti^ fl2009l ) set the stage for the modern era of homomor¬ 
phic schemes where both addition and multiplication to a (theoretically) arbitrary depth 
are possible. In a nut shell, Gentry constructed a scheme based on ideal lattices over 
a polynomial ring which could perform sufficient homomorphic operations to evaluate a 
so-called ‘squashed’ version of its own decryption algorithm: thus, given an encrypted 
version of a hint about the secret key, evaluating the decryption homomorphically results 
in a ‘fresh’ cipher text where the noise level is reset. 

This quickly spawned many other schemes which invoked these techniques. Two 
conceptually much sim pler schemes using the tech nique and based on large intege r cipher 
texts were d eveloped in van Diik et al. f 2ninh and Smart and Vercanteren ( 20101. Stehle 
and Steinfeld 1 2010 "directly improved oniGentrvI 1200911 making evaluation of operations 
less complex. Brakerski and Vaikuntanathan (12011^ ) used the Gentry approach removing 
some untested security assumptions which had been made. These works were in a sense 
the ‘first generation’ of modern schemes. 

Brakerski and Vaikuntanathan ( 2011(j) trigg ered a s e cond generation of schemes based 
on the “learning with errors” (LWE) problem f Regevl. 12009 ) which did not rely on the 
poorly understood hardness assumptions of ideal lattices or ‘squashing’ of the decryption 
circuit to achieve full homomorphism. Moreover, it ensured that the size of the public key 
was independent of the depth of operations to be perforn ied: implementations of G entry’s 
original scheme required upto 2.3 giga byte public keys ( Gentry and Halevi . 201lh ! This 
second generation of schemes includes Brakerski et al\ ( 2012 1 which introduced ‘leveled’ 
schemes, where noise grows linearly; iBrakerskil (1201211 which introduced sc a le-iny ariance 
reducing the number of keys that must be stored: iFan and Vercantereiil ( 2012 1 which 
provided a practical scheme, porting sc ale invariance to thelBrakerski et al. ( 201^ s cheme 
and setting it in a ring-LWE context ( Lvubashevskv et aL 12010 1: iGentrv et all ( 2013 ) 
which introduced a highly novel LWE approach where cipher texts are matric e s and 
operations follow standard matrix arithmetic; and [Brakerski and Vaikuntanathan! (1201411 
where they focus on matching security levels of non-homomorphic schemes, among others. 


B Ring Learning With Errors (LWE) 


The ring LWE hardness result underlies the homomorphic encryption sche me revi e wed i n 
Section 1^751 It is a ring based extension of the original LWE result due to iRegevI (120091) . 
For the interested reader this appendix provides a short simplihed explanation of the 
problem the security of the scheme relies upon. The notation here follows that of Section 

1231 


The original LWE problem requires reconstruction of a secret vector s = (si,..., Sn) G 
Zg, for some g G Z, when only in possession of a collection of approximate random linear 
equations. First, imagine forming the results of many linear equations, zj = by 

choosing uniformly random vectors Sj ~ Z”. Then, given n realisations of {zj, Oj} it is a 
simple matter of solving a system of linear equations to recover s. 

However, consider the approximate version of this problem: given a uniformly random 
vector ttj ~ Z”, form instead the perturbed inner products yj = where Cj is 

a scalar discrete random Gaussian draw. Then, given many realisations of the 

objective is to solve {aj,x) ~ yj for x. For appropriate choices of the error this can be 
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shown to be an exceptionally hard problem: certainly as hard as traditional worst-case 
lattice problem s which have been we l l stud ied. 

Ring LWE flLvnbashevskv et al\. l2ninh ports the same results to the more complex 
polynomial ring setting, but the formulation is essentially unchanged in that it is now 
simply solution of a system of perturbed linear equations in an algebraic ring. 

Notice that the public key in Section |23]is precisely the ring LWE problem: the public 
key contains a masked version of the secret key, with the security of doing this based on 
the difficulty of recovering it due to the ring LWE problem hardness. 







